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ABSTRACT 


Experience  with  the  CRAY-2  on  the  effects  of  common  memory  speed  and  loading  on 
performance  indicate  that  local-memory-based  algorithms  have  potentially  a  large 
advantage.  The  performance  of  a  number  of  common-  and  local-memory  algorithms 
are  compared  for  the  LU  factorization  of  a  dense  system  of  equations  on  the  CRAY-2. 
Results  of  both  Fortran  and  assembly  language  implementations  are  given. 
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I.  INTRODUCTION 

A.  CRAY-2  Architecture  and  Algorithm  Implications 

The  CRAY-2  architecture  of  Figure  1  has  several  features  relevant  to  this  algorithm 
study. 

(a)  Common  memory  features.  The  massive  common  memory  (CM)  trades 
size  for  access  time,  so  that  a  considerable  delay  is  encountered  in 
reading  from  CM.  Also,  only  one  data  path  connects  common  memory  to 
each  processor's  functional  units. 

(b)  Local  memory.  The  above  speed  disadvantages  are  compensated  by  a 
local  memory  (LM),  which  serves  as  backup  vector  and  scalar  storage  for 
the  functional  unit's  register  storage. 

(c)  Chaining.  The  CRAY-2  does  not  have  hardware  chaining;  this  must  be 
achieved  by  software  and/or  algorithm  means. 

The  implications  of  distributed  memory  (including  hierarchies  such  as  CM  and  LM)  on 
linear  algebra  algorithm  organization  has  been  studied  since  the  existence  of  paged 
memory  systems  [1[2][3][4].  In  general,  computations  must  be  arranged  so  that  the 
number  of  floating-point  operations  on  data  at  the  low  memory  levels  is  sufficient  to 
warrant  data  transfers  to  these  levels.  This  implies,  for  example,  that  a  matrix-vector 
multiply  •  which  involves  only  two  operations  for  each  matrix  data  element  •  will 
perform  less  efficiently  than  a  matrix-matrix  multiply. 

B.  Review  of  Vector  Linear  Algebra  Algorithms 

The  asymptotic  execution  rate  (MFLOPS)  of  a  factorization  algorithm  is  equal  that  of 
the  kernel  that  perlorms  the  add-multiplies  associated  with  reducing  rows  and 
columns.  Three  substantially  different  such  algorithms  deserve  consideration. 

(a)  Gauss  vector-scalar  multiply.  This  requires  that,  in  reducing  the  rth  row  , 
successive  operations  on  preceding  rows  must  be  performed  serially  since  a  partial 
result  from  each  row-operation  is  used  as  an  operand  in  the  next  one  (the  reader  is 
assumed  familiar  with  this  procedure).  In  rows  with  lengths  longer  than  the  vector 
functional  unit  length,  this  dependency  can  usually  be  avoided  by  assembly  coding;  it 
is  then  termed  the  GAXPY  kernel.  It  has  the  advantage  of  yielding  the  largest  average 
vector  length  of  any  of  the  following  kernels  and  so  is  a  serious  consideration  when  the 
matrix  size  is  not  significantly  larger  than  the  maximum  allowable  vector  length,  and 
assembly  coding  is  allowed.  This  procedure  does  not  fend  itself  to  partitioning  when 
large  matrices  are  involved,  but  is  a  potentially  useful  subalgorithm  in  such  cases. 

(b)  Matrix-vector  multiply.  Early  experience  with  the  CRAY-1  indicated  that 
block-oriented  algorithm  organization  had  at  least  a  pedagogical  advantage  for  large 
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problems[5][6].  Unfortunately,  the  necessary  {vector  -  (matrix'vector)}  kernel  was  not 
made  a  part  of  the  CRAY  scientific  library  and  consequently  the  kernel  was  not 
syntactically  distinguised  from  the  rest  of  CAL-coded  factorization  algorithms.  The 
organizational  concept  of  basing  factorization  on  matrix-vector  multiply  subroutines 
was  developed  in  [7][8],  where  it  was  illustrated  how  unrolling  Fortran  loops  could  be 
used  to  achieve  a  high  performance  completely  from  a  high-level  language.  This 
emphasized  portability  and  maintainability.  More  recently,  these  kernels  -  known  as 
second-level  BLAS  -  have  been  proposed  as  the  basis  for  other  common  linear 
algebra  algorithms[9]. 

(c)  Matrix-matrix  multiply.  Although  it  has  always  been  clear  that  factorization 
can  be  accomplished  by  a  matrix-matrix  multiply  kernel,  the  CRAY-1  memory 
hierarchy  was  not  sufficiently  distinctive  to  achieve  a  significant  advantage  over  the 
kernels  of  (a)  and  (b)  above  [3].  The  additional  memory  paths  of  the  X-MP  made  this 
even  less  attractive,  and  partially  reduced  the  advantage  of  the  matrix-vector  multiply 
above.  The  disadvantages  of  basing  factorization  on  matrix  multiplies  are  the 
necessity  for  other  matrix-level  kernels  to  perform  reciprocations  and  substitutions  and, 
most  important,  the  difficulty  of  partial  pivoting. 

The  memory  distribution  must  be  quite  distinctive  to  warrant  the  prog.amming  effort  of 
(c).  This  report  documents  this  case  for  the  CRAY-2,  while  laying  the  groundwork  for  a 
multiprocessor  implementation. 


II.  NON-PIVOTING  ALGORITHMS  AND  PERFORMANCE 
A.  M*V-based  Algorithms 
Given  a  set  of  equations* 

AX  a  B 

where  A  is  an  nxn  matrix  and  X  and  B  are  vectors,  the  factorized  solution  proceeds  by 
forming  lower  and  upper  triangular  factors  L  and  U,  viz 

A  a  L  U  (1) 

and  then  solving 

LY  «  B 


Matrices  are  in  bold,  vectors  are  in  upper  case,  and  scalars  are  in  lower  case  type. 


for  X.  The  complexity  of  the  factorization  step  of  Eq.  (1 )  is  O(n^).  The  other  steps  have 
complexity  O(n^),  with  only  one  add-multiply  for  each  L  and  U  element;  consequently 
no  algorithmic  speedup  results  from  transferring  L  and  U  to  local  memory,  so  that 
these  steps,  if  performed  separately  from  the  factorization  as  above,  will  not  be  studied. 

In  the  non-pivoting  algorithms  based  on  a  matrix-vector  multiply  (abbr.  M*V-based),  the 
columns  of  L  and  rows  of  U  are  indicated  as  in  Figure  2a.  Here  the  diagonal  element, 
the  row  to  its  right,  and  the  column  below  it  are  denoted  a22»  Ai2>  ancl  A21 

respectively.  The  steps  to  perform  the  factorization  are  then 


Matrix-vector  multiplies: 


a22 

<— 

a22  * 

a21 

a12 

(2) 

A32 

a32 

a31 

A23 

<— 

a23  ' 

a21 

A13 

(3) 

Reciprocation: 

a22 

<— 

1/a22 

(4) 

Reciprocal  propogation: 

a32 

< — 

a22  a32 

(5) 

B.  M*M-based  Algorithms 

Another  matrix  partition  permits  the  factorization  to  be  performed  on  submatrices 
(Figure  2b}.  The  equations  equivalent  to  (2)  -  (5)  are 


Matrix-matrix  multiply: 


a22 

<-— 

a22 

-  a21 

a12 

(6) 

a32 

a32 

a31 

a23 

< — 

a23 

*  a21 

a13 

(7) 

Factorization: 

a22 

<— 

CM 

04 

_i 

u22 

(8) 

Substitution: 

a32 

< — 

a32 

-1  -1 
u22  l22 

0) 

Alternatively,  Eq.  (9)  can  be  replaced  by  two  substitution  steps 

a32 

<— 

a32 

-1 

u22 

(9a) 

a23 

< — 

-1 

l22 

eo 

CM 

< 

(9b) 

The  advantage  of  Eq.  (9)  with  a  local  memory  will  be  discussed  below. 

The  critical  size  parameter  is  the  dimension  of  the  square  diagonal  block  (nd).  This 

has  been  chosen  to  be  64,  the  maximum  vector  length  of  the  CRAY-2,  and  a  length 
consistent  with  local  memory  size  (see  below). 

C.  Multiply  Kernel  Partitioning  and  Performance 

1.  Block  partitioning 

Because  of  the  restricted  size  of  LM,  it  is  not  feasible  to  load  A32  and  A23  into  local 
memory.  Therefore,  submatrices  are  defined  ,  viz 


(10) 


Ali  =  Aij,11  aIJ,12  Alj,1r 

Al  j,21  Alj,22  Ai|,2r 


I  AIJ,t1  AIJ,t2  Alj,tr 

The  two  dimensions  of  these  component  matrices  -  except  at  boundaries  of  A  -  is 
chosen  to  be  nd;  with  nd  =  64,  up  to  three  such  matices  can  be  stored  in  the  16K  LM, 

with  room  left  for  system  and  temporary  storage. 

The  components  of  Eqs.  (6)  and  (7)  are  computed  by  a  block  dot  product,  viz 

s 

a23,1J  <  *  a23,1J  ‘  £  A21,1k  A13,kJ  01) 

k=1 

The  complexity  of  such  ndxnd  block  multiplies  is  0((n/nd)3)  for  the  entire  factorization 
algorithm. 

2.  Data  transfer  overhead 

The  total  number  of  inter-memory  data  transfers  to  form  Eq.  (11)  is  2(1+s)nd2.  With  a 

clock  period  of  tc,  a  transfer  rate  of  W  words/clock  and  a  floating  point  execution  rate  of 
Rg  for  data  resident  in  LM,  the  multiply  execution  rate  to  produce  A23f|j  above  is 

R0 

R  -  _  (12) 

1  +  1  +  s  tc  Rq 

s  Wnjj 

With  R0  *  430  MFLOPS,  W  *  .8  (using  CAL-coded  transfer  routines),  and  tc  ■  4.1  nsec, 
then  402  <  R  <  415  MFLOPS  for  1  <  s  <  *o. 

3.  Conflict  sensitivity 

The  presence  of  bank  conflicts  affects  the  transfer  rate  W.  It  is  therefore  informative  to 
compute  a  normalized  fractional  sensitivity 


S  =  (dR/R)/(dW/W) 


=  1  (13) 

1  +  _S_  JfiLOd 
1+s  tc  Rq 

For  the  above  values,  .064  >  S  >  .033  for  1  <  s  <  Thus,  with  dW/W  =  -.5  (an 
average  delay  of  50%,  not  necessarily  representative),  reductions  in  R  of 
approximately  12  MFLOPS  are  predicted.  The  block  multiply  rate  is  now  390  <  R  < 

403  MFLOPS  for  1  <  s  <  «>.  This  execution  rate  is  far  greater  than  might  be  expected 
from  a  multiply  code  executing  from  CM  with  the  same  interference. 

D.  Block  Factorization  and  Substitution 

Both  of  the  substitution  steps  of  Eq.  (9)  can  be  performed  with  a  sinlge  matrix  load  of 
LM  provided  that  A32  is  partitioned  as  above,  e.g., 


-1  -1 

A32,i1  <—  a32,I1  u22  l22 


(14) 


The  execution  rates  of  CAL-coded  LU  factorization  and  substitution  steps  of  Eqs.  (8) 
and  (14)  operating  from  local  memory  with  n^  =  64  have  been  measured  as  124  and 

200  MFLOPS,  respectively,  including  the  effects  of  memory  transfers  and  bank  conflicts 
during  a  daytime  load.. 

E.  Overall  Performance 

Figure  3  gives  the  performance  of  a  CAL-coded  M*M-based  factorization  utilizing  LM  in 
comparison  with  Fortran-  and  CAL-coded  M*V-based  algorithms  executing  from  CM. 
(Recall  that  LM  M*V  kernels  are  inefficient  and  so  are  not  studied.)  These  were  run 
during  a  daytime  load  at  MFECC,  using  the  CIVIC  Fortran  compiler  circa  November  14, 
1985.  The  performance  of  a  standard  Gauss  column-based  Fortran  code  is  also 
shown. 

The  Fortran  Gauss  algorithm  (Appendix  A)  does  not  permit  unrolling  or  other 
techniques  for  overcoming  the  hardware  disadvantages  of  no  chaining  and  a  long  CM 
path;  the  result  is  that  no  overlapping  can  be  achieved  between  the  three  vector 
memory  accesses,  the  vector  multiply,  or  the  vector  add  that  characterize  the  inner 
loop.  A  rate  of  60  MFLOPS  for  n=2048  is  the  result.  This  is  likely  the  asymptotic  rate 
of  any  Fortran  factorization  not  utilizating  loop  unrolling  or  LM  (see  Appendix  B). 


Figure  3.  Performance  of  non-pivoting  factorization 


The  Fortran  M*V-based  algorithm  performances  for  n=2048  show  a  speedup  of  2.96:1 
over  Gauss,  principally  due  to  a  16-way  unrolling  of  the  matrix-vector  multiply.  (The 
unrolling  issue  is  studied  in  Appendix  B  in  detail).  Assembly  coding  yields  another 
1.62:1  speedup  over  Fortran,  due  in  part  to  reduction  of  scalar  operations  associated 
with  unrolling  and  to  the  resistance  to  bank  conflicts  which  can  be  achieved  in  CAL  by 
pre-fetching  vector  operands. 

The  LM  M*M-based  factorization  ranges  in  performance  between  124  MFLOPS  for 
n=64  -  when  only  the  factorization  of  Eq.  (8)  must  be  performed  -  to  nearly  400 
MFLOPS  when  n*2048  and  Eq.  (11)  is  dominant.  The  200-MFLOP  performance  of  the 
substitution  step  of  Eq.  (8)  maintains  a  relatively  high  performance  vis-a-vis  the  other 
implementations  for  intermediate  values  of  n. 


III.  PIVOTING  ALGORITHMS  AND  PERFORMANCE 
A.  Influence  of  Pivoting 

On  a  vector  machine  such  as  the  CRAY-2,  partial  column  pivoting  has  two 
components:  (1)  the  search  for  the  maximum  element  of  a  column,  and  (2)  exchange  of 
two  complete  rows  of  the  matrix.  The  latter  is  usually  preferred  over  maintenance  of  an 
index  pointer  in  order  to  avoid  relatively  slow  indirect  addressing.  These  two  functions 
are  denoted 

a  <—  piv  { s.  V } 

where  a  is  the  element  of  maximum  absolute  value  of  scalar  s  and  the  elements  of 
vector  V. 

In  M*V-based  factorization  this  search  is  routinely  performed  after  Eq.  (2)  or  (3)  by  the 
step 


a22  <—  piv  { a22>  A32 }  (3a) 

However,  in  the  M*M-based  version,  the  granularity  of  the  algorithm  does  not 
recognize  individual  matrix  elements  and  columns.  The  problem  then  becomes  to 
preserve  the  high  performance. of  the  matfo-malrix  multiply  performing  the  majority 

of  computations  at  the  block-level,  vet  to  occasionally  expose  individual  columns  to 
permit  pivoting.  The  solution  is  the  following  2-level  algorithm  (Figure  14). 

Equations  (6)  and  (7)  are  carried  out,  viz 
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Matrix-matrix  multiply: 


a22 

<— 

a22 

- 

a21 

a12 

(15) 

a32 

a32 

a31 

a23 

<— 

a23 

- 

a21 

a13 

(16) 

as  an  O^n/n^)  process.  The  columns  of  the  resulting  Ax  =  [  A22  A32  ]"*" 

block-column  matrix  are  at  this  point  partially  reduced,  with  all  the  accumulations  from 
the  columns  of  A31  performed  but  without  contributions  from  the  internal  columns  of 

Ax.  Ax  is  then  reduced  using  either  a  Gauss  column  reduction  or  the  M*V  method  of 
Eqs.  (2)-(5),  viz  (an  underline  represents  components  of  this  second  reduction  level) 


322 

<— 

< 

1 

CM 

CM 

«* 

A12 

(17) 

^32 

^32  1  ^31  I 

A23 

<— 

A23  *  ^21 

A.13 

(18) 

322 

<— 

piv  { 322  •  ^32  ) 

(19) 

322 

<— 

1/322 

(20) 

^32 

<— 

322  A32 

(21) 

The  computations  of  Eq.  (17-21)  are  performed  from  CM  and  will  so  be  slowed  by 
memory  access  delays. 

Exclusive  of  the  pivoting  of  Eq.  (19),  the  result  of  these  level-2  steps  is  the  equivalent  of 
factoring  A22  in  Eq.  (8)  and  performing  A32  (U22  )“^  in  Eq.  (9a).  The  substitution 
with  L22  may  then  be  performed  either  on  the  resultant  A32  as  in  Eq.  (9)  or  on  A23 

as  in  Eq.  (9b).  In  either  case,  this  can  be  carried  out  in  local  memory  at  a  speed 
somewhat  less  than  the  200  MFLOPS  noted  above. 


With  nd  fixed,  the  complexity  of  level  2  is  readily  shown  to  be  0(n2),  whereas  the  M*M 
kernel  complexity  remains  0((n/n(j)3).  For  large  n,  the  execution  rate  should  therefore 
approach  that  of  the  multiply  kernel  or  approximately  400  MFLOPS. 

B.  Implementation  and  Performance 

Figure  5  presents  the  results  of  the  same  algorithms  as  Figure  3  but  with  pivoting.  The 
rise  of  the  M*M  algorithm  to  the  asymptotic  rate  is  now  slower.  Several  explanations 
are  offered  for  this  performance. 

(a)  M*M  vs  M*V  CAL  performance.  Without  pivoting,  the  advantage  of 
M*M-based  factorization  was  maintained  for  all  n;  in  Figure  5  this  occurs  only  for  large 
n.  This  is  explained  in  part  by  the  complete  CAL-coding  of  the  diagonal  block 
M*M-based  factorization  of  Eq.  (8)  without  pivoting  (Figure  3),  and  the  mixed  Fortran- 
and  CAL-coding  of  this  step  in  both  M*V-based  factorization  -  with  and  without  pivoting 
-  and  in  level  2  of  the  M*M-based  pivoting  algorithm.  With  a  mixed  coding,  all 
matrix-vector  multiplies  and  searches  for  the  maximum  element  of  a  column  are 
performed  in  CAL,  but  the  subroutine  linkage  to  these  routines  introduces  an 
overhead  with  O(n)  complexity.  Thus,  comparisons  in  Figure  3  for  small  n  include  the 
effects  of  different  codings,  whereas  those  of  Figure  5  do  not. 

(b)  Effect  of  2-level  algorithm.  The  CRAY-2  implementation  the  piv{  s,  V }  function 
of  Eq.  (3a)  requires  a  fixed  overhead  dependent  only  on  the  length  of  V  and 
independent  of  the  matrix  element  values.  Consequently,  it  is  possible  to  delineate 
between  the  pivoting  speedown  due  to  piv{  s,  V }  and  that  due  to  the  2-level  nature  of 
the  algorithm.  These  are  presented  in  Figure  6  for  the  M*M  algorithm.  For  n  >  256  - 
where  the  effect  of  the  coding  differences  of  (a)  is  largely  dissipated  -  the  larger 
degradation  is  the  result  of  the  piv  { s,  V  }  function.  Since  the  latter  cannot  be  avoided, 
the  algorithmic  speedown  from  the  introduction  of  a  second  level  does  not  appear 
significant. 


IV.  CONCLUSIONS 

In  general,  the  partitioning  of  an  algorithm  into  larger  computational  tasks  favors  a 
parallel  implementation  since  fewer  task  startups  are  involved.  However,  in  a  CRAY-2 
system  dedicated  to  an  equation  solution,  equalizing  the  workload  among  the 
processors  (load-leveling)  also  becomes  an  issue;  this  favors  smaller  tasks  associated 
with  M*V-based  factorization.  These  issues  will  be  investigated  in  a  companion 
report. 
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APPENDIX  A 


FORTRAN  PROGRAM  LISTINGS 


C  *  *■  •  *  *  Gr«US5  FctCTORI Z«TI  ON 

IMPLICIT  REhUh  -  H.O  -  Z' 

>  DIMENSION  TEMP<2048),  A(2049,2048> 

NOIM  -  2049 

>  CALLLINK<“UNIT6»TERMINAL//" > 

>  CALLLINK< "UNIT5“TERMINAL//" ) 

>C****  READ  MATRIX  SIZEDL 

>  10  WRITE  <5,20) 

>  20  FORMAT  ('  ENTER  MATRIX  SI2E') 

>  READ  <5,30)  N 

>  30  FORMAT  05) 

FORMULATE  DIAGONALLY -DOM INANT  MATRIX 
>■  DO  50  I  -  1  ,  N 

>  DO  40  J  =  1 ,  N 

40  A<I,J>  *  -N  ♦  IABS< I  -  J) 

>  50  A( I , I )  ■  l.l  *  <<N  -  1)*N  -  <1  -  1 ) *1/2  -  <N  -  I  ♦  1 ) »<N  -  t )/2) 

>  T1  -  SECOND<  0 ) 

>  CALL  FAC<N,  NDIM,  A,  TEMP) 

>  T1  *  SECOND <  0 )  -  T 1 
AN  -  N 

>  AN  2  *  AN  »  AN 

>  AN 3  ■  AN 2  •  AN 

>  OP  -  < <2.*AN3)/3. >  -  < AN2/2 . )  ♦  <5.*AN/6.) 
flops  ■  OP  /  Tl 

>  WRITE  (5,40)  A<N,N> 

>  WRITE  (5,60)  Tl,  OP,  FLOPS 

>  60  FORMAT  (3E14.6) 

>  00  TO  10 

>  END 

>  SUBROUTINE  FAC<N,  NP1 ,  A,  TEMP) 

’PLI CIT  REAL < A  -  H,0  -  2) 

DIMENSION  A<NP1 ,1 ) ,  TEMP<1 > 

>  NM1  -  N  -  1 

>  IF  (NM1  -.EQ.  0)  GO  TO  80 

)  DO  70  J  -  1 ,  NM1 

>  10  NMJPI  »  N  -  J  ♦  1 

>  JJ  -  I SAMAX<NMJP1 ,A( J , J) ,1)  ♦  J  -  l 

>  T  »  ABS( A<  JJ , J) ) 

>  IF  <T  . EQ.  0)  WRITE  <6,20) 

>  20  FORMAT  ('  ZERO  PIVOT') 

>  DO  30  I  -  1  ,  N 

>  30  TEMPO)  -  A<  J ,  I  ) 

>CDIR«  IVDEP 

>  DO  40  I  ■  I,  N 

>  40  A<  J , I )  •  A( JJ , I ) 

>  DO  50  I  -  1 ,  N 

>  50  A<  J  J ,  I  )  ■  TEMPO) 

>C»«*  RECIPROCATE  DIAGONAL 

>  A<  J , J)  ■  1.  /  A<  J , J) 

>  ALPHA  »  A( J, J) 

>  IJ  »  J  ♦  1 

>C*»*  PROPOGATE  RECIPROCATED  DIAGONAL  DOWN  COLUMN 

>  DO  60  I  »  IJ,  N 

>  60  AO  ,  J)  ■  ALPHA  •  Ad  ,  J) 

>C*»*  MULTIPLY  COL.  BY  ELEMENT  fc  SUBTRACT  FROM  ANOTHER  COL. 

)  00  70  I  »  IJ,  N 

>  ALPHA  -  A< I ,J) 

/COIR*  IVOEP 

/  >  DO  70  K  ■  IJ,  N 

>  70  AO  ,K>  -  AO,K>  -  ALPHA  •  A<J,K> 

>C»*«  RECIPROCATE  LAST  DIAGONAL 

>  80  A<N,N)  «  1 .  /  A<N,N) 

>  RETURN 

>  FNO 


Tabip  A-l.  Gauss  factorization 

lo 


C****  FACTORIZATION  USING  14-U-  UNROLLED  MwTPI  \ -VECTOR  MULTIPLIES 
IMPLICIT  REALIA  -  H,Q  -  Z> 

DIMENSION  TEMPI  1024),  A! 1025, 1024) 

CALLLINK!  “UNITS  -  TERMINAL  //  "> 

>  CALLLINK!  “UNIT4  -  TERMINAL  //  *) 

NDIM  ■  102S 

>  10  WRITE  '.5,20) 

20  FORMAT  ('  ENTER  MATRIX  SIZE) 

READ  (.5,30)  N 
30  FORMAT  15) 

>C**»  FORMULATE  DIAGONALLY -DOMINANT  MATRIX 
)  DO  50  I  -  1 .  N 

>  DO  40  J  »  1  ,  N 

40  AlI.J)  *  -N  ♦  IA8SII  -  J) 

50  A-.  1,1)  -  l.l  *  <  •  N  -  1  )  #N  -  (I  -  l)*I/2  -  IN  -  I  ♦  1  )  *<N  -  I  .  -  2 1 
*0  T2  »  SECOND! 0) 

>  DO  1 40  J  =  l ,  N 

IF  (J  .EQ.  1)  GO  TO  70 

CALL  SMXPYIAI 1 , J) ,  A(J,J),  A!J,1>,  N  -  J  ♦  1,  J  -  1,  NDIM) 

>C**  SEARCH  FOR  PIVOT 

>C2  CONTINUE 

-■C  FORM  JTH  ROW  OF  U 

70  NMJP1  *  N  -  J  ♦  1 

>  JJ  «  ISAMAXINMJP1 ,A< J, J) ,1 )  ♦  J  -  1 

>  T  *  ABS! A( JJ , J) ) 


IF  (T  .EQ 

.  0)  WRITE  (4 

>  80 

FORMAT  ( 

ZERO  PIVOT  > 

> 

DO  90  I  - 

1,  N 

>  90 

TEMPI  I  )  =* 

A<  J ,  I  > 

>CDIR* 

IVDEP 

DO  100  I  »  1,  N 

>  100  A<  J , I )  ■  A<  J J , I ) 

>  DO  1 1 0  I  *  1  ,  N 

>110  A<. JJ,I )  ■  TEMPI  I ) 

>  A< J, J)  -  1 .  /  A<  J. J) 

>  IF  <J  ,EQ.  N)  GO  TO  150 

>  IF  <J  .EQ.  1)  GO  TO  120 

>  CALL  SXMP  < IAI  J , 1 ) ,  A!  J ,  J  *  1 ) ,  A< 1 , J  ♦  1 ) ,  N  -  J ,  J  -  l ,  NDIM, 

>  1  NDIM) 

>  120  T  *  A<  J , J) 

>  JP1  ■  J  ♦  1 

>  DO  130  1  -  JPl ,  N 

>  130  A(J,I)  «  T  •  A< J , I ) 

>  140  CONTINUE 

>  130  T1  ■  SECOND! 0)  -  T2 

>  AN  -  N 

>  AN 2  -  AN  *  AN 

>  aN3  ■  AN 2  •  AN 

>  OP  -  U2.«AN3>/3.)  -  !«4 2/2.)  ♦  <3.«AN/4.) 

>  FLOPS  -  OP  /  T1 

>  WRITE  (S, 140)  Tl,  OP,  FLOPS 

>  140  FORMAT  OE13.4) 

>  00  TO  10 

>  END 


Tablp  A-2.  Matrix-vector  multiply-based  factorization 


i>***  TRANSPOSED  TROLLED  MATRIX -‘.-ECTOR  M'.LTI  PL  r 

>  SUBROUTINE  SXMPY ( X ,  Y,  M,  Nl  .  N2.  NDCM.  NO) i  * 

>  REAL  X ( NOXY , 1  > ,  Y (NOx i  ,  1  )  ,  M<NOCM. 1 ) 

;  J  »  M0D<N2,2> 

IF  (J  .LT.  1>  GO  TO  20 

>  DO  10  I  >  1,  NI 

>  10  i'.  1. 1)  «  <  Y(  1  , 1  )  >  -  X(1,J)  *  M<J,I' 

>  20  J  *  M0D<N2,4) 

>  IF  (J  .LT.  2)  GO  TO  40 

;  DO  30  I  »  1 ,  Nt 

V  30  YU, I)  *  <.»YU,I))  -  XU,J  -  1>#M<J  -  1,1))  -  X(1,J>  *  MtJ.I) 

/  40  J  «  MOD(N2 ,8) 

>  IF  (J  .LT.  4)  GO  TO  SO 
DO  30  I  *  1  .  ill 

>  50  YU, I)  *  UUYU.I))  -  X<1,J  -  3)#M(J  -  3 , 1  > )  -  XU,J  -  2>*M<J  - 

'  12,1))  -  XU,J  -  1  ;  *M  <  J  -  1,D)  -  XU,J>  *  M<J,I) 

)  *0  J  *  MOD<N2 , 1 d) 

>  IF  (J  .LT.  3)  GO  TO  30 
DO  70  1  *  1 ,  Nl 

>  ~.j  ,U,I)  «  U  v  U  U  <Y(  1  ,1 )  )  -  XU,J  -  7)  *M(  J  -  7,1))  -  XU,J  -  6>*Mv 

1J  -  6, 1))  -  XU,J  -  5)  #M<  J  -  5,1))  -  XU,J  -  4)  *M<  J  -  4,I>)  -  XU, 
^  2J  -  3)  »M<  J  -  3,D)  -  XU,J  -  2)  *M<  J  -  2 , 1  >  )  -  XU,J  -  1>*M<J  -  1, 

>  31  )>  -  XU  ,  J)  *  M<  J,  I  ) 

>  30  JMIN  -  J  ♦  16 

>  IF  -'JMIN  .GT.  N2>  GO  TO  100 

>  DO  100  J  «  JMIN,  N2,  16 

>  DO  90  I  -  1 ,  Nl 

/  90  Y(l, I)  »  <<<<<<<<<<<<<<<<Y<1  ,J>>  -  XU,J  -  1 3)  *M< J  -  13,1))  -  XC 

>  1  1,J  -  1 4)  *M<  J  -  14,1))  -  X(1,J  -  1  3)  *MC J  -  13, I>)  -  X(1,J  -  12)* 

>  2  M(J  -  12,1))  -  X<1,J  -  1  1  >*M<  J  -  11, I>)  -  X(1,J  -  10)  *M J  -  10. 

>  3  I))  -  XU,J  -  9)  #M<  J  -  9,1))  -  XU,J  -  8)  *M(  J  -  8,1))  -  XC1.J  - 

>  4  7)*M<  J  -  7,1))  -  X<1,J  -  6>*M<  J  -  6,1))  -  XU,J  -  5>«M<J  -  3,1;) 

>  3  -XU,J  -  4)  *M<  J  -  4,1))  -  X<t,J  -  3>*M<J  -  3 , 1  >  )  -  X(1,J  -  2)*M( 

>  6  J  -  2,D)  -  XU,J  -  1  )  *M<  J  -  l, I))  -  X(1,J)  *  M<  J  ,  I  ) 

>  100  CONTINUE 

>  RETURN 

>  END 

>C»***  UNROLLED  MATRIX-VECTOR  MULTIPLY 

>  SUBROUTINE  3MXPY<X,  Y,  M,  Nl ,  N2 ,  NDIM) 

REAL  X<1),  Y<1),  M<NDIM , 1 ) 

>  J  »  MOO<N2,2> 

>  IF  <J  .LT.  1)  GO  TO  20 

>  DO  10  I  -  l ,  NI 

>  10  YU)  »  (YU))  -  X<J)  *  MU  , J) 

>  20  J  -  M0D<N2,4> 

>  IF  <J  .LT.  2)  GO  TO  40 

>  DO  30  I  -  1,  Nl 

>  30  YCI)  -  <  <Y<  I ) )  -  X<J  -  1)*MU,J  -  1)>  -  X<J)  *  MU  ,  J) 

>  40  J  »  M00<N2,8> 

>  IF  <J  .LT.  4)  GO  TO  60 

>  DO  50  I  -  1  ,  Nl 

>  30  YU)  -  <<<<YU>>  -  X<J  -  3>«MU,J  -  3))  -  X<J  -  2>*MU,J  -  2)>  - 

>  1X(J  -  1 )*M< 1 , J  -  1))  -  X<J)  •  M< I , J) 

>  60  J  •  M0D(N2 , 16) 

)  IF  <J  .LT.  8)  GO  TO  80 

>  DO  70  I  -  1,  Nl 


Table  A-2.  Continued 


20 


V  J 


70  f  •  I  •  =  -  a<.  J  -  7 «  *H‘  I  ,  J  -  7  ■  .<  - 

1  -  X(J  -  5  >  < I  ,  J  -  5')  -  A<J  -  4  >  *M  (  I  ,  J  -  4-' 

23))  -  a'.  J  -  2>*M<I.J  -  2))  -  X<J  -  1  )  *M  <  I  ,  J  - 


1  )  > 


-  I  .  J  - 

•  j  -  3  .•  *ti‘.  i ,  j  - 

-  x<  j>  #  M(  i .  J) 


SU  JMIN  =»  J  +  16 

IF  (JMIN  .GT.  N2)  GO  TO  100 
DO  100  J  =  JMIN,  N2,  16 
P>:  ; )  I  *  1  ,  N1 

?0  Y*I)  *  <<<<<<<(<<<<<<<<Y<I>>  -  X(J  -  1  5 )  *M  <  I , J  -  15))  -  X(J  - 
1  14)  *M<  I  ,  J  -  14))  -  X<J  -  1 3)  *M(  I  ,  J  -  13))  -  X(J  -  12)*M<I,J  - 


2 

12))  - 

X<  J  -  1 1 )#M< I  , J  -  1 1 >  > 

-  XC J  -  10)#M<I . J  - 

10)  >  - 

X(J  - 

3 

9 ) *M( I 

,  J  -  9)  >  -  X<  J  -  8)  *M<  I 

,  J  -  8)  )  -  X(  J  -  7  >  «M ( I  ,  J  - 

7)  )  - 

4 

X(J  - 

4) *M< I , J  -  4) )  -  X< J  - 

5) *M( I , J  -  5) )  -  X< J 

-  4)«M(1 , J  - 

5 

4)  )  - 

X<  J  -  3)  *M<  I  ,  J  -  3)  )  - 

X<J  -  2)»M(I ,J  -  2)) 

-  X(J 

-  1  )  *M  ( 

6 

I.J  - 

1  ) )  -  :<(  J)  *  M(  I  ,  J) 

100  CONTINUE 
RETURN 
END 


Table  A-2.  Continued 


Copy  available  to  DT1C  does  not 
permit  fully  legible  reproduction 


*1 


>c* 

>c» 

> 

> 

> 

> 

> 

> 

> 

> 

> 

> 

> 


.  CIVIC2  BLOCKED  FACT  OR 1 2 AT 1  ON  &  SOLUTION  WITH  PIVOTING 
THIS  VERSION  HAS  ASSEMBLY-COOED  SLM3D.LLM3D 
MXMPMA , MXVPVA , SUSP I V .  AND  I SAMAX  ROUTINES 
IMPLICIT  REAL < A  -  H.O  -  2) 

INTEGER  QDIM 

DIMENSION  B<  2048)  ,  AS<1,1>,  TEMP<2048>,  A(2049,2048> 
DIMENSION  TIME< 4) 

REGFILELM1.  LM2,  LM3 
COMMON  /LM1/  AT (44,44) 

COMMON  /LM2/  Q-.o4.o4) 

COMMON  /LM3/  SUB <64. 64) 

CALLLI  NK<"UNIT6-TEMRINAL  //  *  > 

CALLLI  NKCUNIT5-TERMINAL  //  "» 

NDIM  »  2049 
QDIM  »  64 


>  QDIM  »  64 

READ  MATRIX  SI2E. BLOCK  SIZE 

>  10  WRITE  (3.20) 

>  20  FORMAT  <'  ENTER  N.  NSIZE.') 

>  READ  (3,30)  N.  NSIZE 

>  30  FORMAT  <2I5> 

FORMULATE  DIAGONALLY -DOM INANT  MATRIX 
>C«  AND  RHS  WITH  SOLUTION  B<J)-J+1 

>  DO  50  I  -  1 ,  N 

>  B( 1 )  *  0. 

>  DO  40  J  -  1  ,  N 

>  A< I . J)  ■  -N  ♦  IA8S( I  -  J) 

>  40  IF  <1  .NE.  J)  B< I >  -  B< I )  ♦  A< I , J)  *  <J  ♦ 

>  A< I , I )  »  1.1  *  <<N  -  1  )*N  -  < I  -  1 > #1/2  - 

>  30  B< I )  -  B(I)  ♦  A<I,I)  *  <1  ♦  1> 

>  60  NM1  -  N  -  1 

>  DO  70  LL  -  1,  4 

>  70  TIME<  LL)  -  0. 

MAIN  FACTORIZATION  ALGORITHM 


1 >*<N  -  I >/2> 


>  C 

> 

> 

> 

> 

> 

> 

> 

> 

>C»**» 
>  80 

> 

> 

> 

> 

> 

> 

> 

> 

>c***« 


DO  120  LL  «  1 .  N.  NSIZE 
TIM1  -  SECOND< 0 ) 

LLNS  *  MIN0<  LL  ♦  NSIZE 
LLM1  -  LL  -  1 
NDIAG  «  LLNS  -  LL  ♦  1 
CALL  ALUPI V< A ,  B,  TEMP, 
TIM2  -  SEC0ND< 0) 

TIM3  -  TIM2 
TIM4  •  TIM3 
>  FORM  (A23»L22«*-1 ) 
NMLLNS  -  N  -  LLNS 
IF  (NMLLNS  .EQ.  0>  GO  T 
■  LOAD  LM  WITH  L  1  U  AS  T 


LLNS,  NDIM,  N> 


GO  TO  110 

AS  TUO  SEPARATE  ARRAYS  REQUIRED  BY  SUBPIV 


CALL  LLM3D(A<LL,LL> ,  1,  NDIAG,  NDIAG, 

CALL  LLM3D<A<LL,LL> ,  2,  NDIAG,  NDIAG, 

JX  -  LLNS  ♦  1 

LK  •  LLNS  -  LL 

LKPl  -  LK  ♦  1 

DO  100  J  •  JX,  N,  NSIZE 

NMJ  ■  MIN0(N  -  J  ♦  l, NSIZE) 

LIAO  BLOCK  OF  A23  FROM  LM  INTO  LM 


1 ,  NDIM. 
NDIM,  1, 


QDIM, 

QDIM, 
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•  ■  h  *  *  *  •  *  •  *  *  4  L  '  * 

S  V  * _•  > 


'  CALL  LLM3D<At J.LL) .  3.  NM  J ,  LKPl .  1,1.  NDIM.  1.  QDIM.  n 

SUBSTITUTE  INTO  BLOCK 

>  CALL  SUBPIV<NMJ,  LK.  QDIM) 

STORE  BLOCK  FROM  LM  INTO  CM 

>  90  CALL  SLM3D<A< J ,LL)  .  3,  NMJ.  LKPl,  1,  1.  NDIM.  I,  QDIM,  1> 

>  100  CONTINUE 

>  TIM3  -  SEC0ND<0> 

INNER  LOOP  BLOCK  MULTIPLIES 

>  NMJX  a  N  -  JX  ♦  1 

>  LLNS1  -  MIN0<LLNS  ♦  NSI2E.N) 

>  CALL  MJ<MPMA<A<  JX,1)  ,  1,  NDIM,  AU.JXJ,  1,  NDIM,  A<JX,JX>,  l, 

>  1  NDIM,  NMJX,  LLNS.  LLNS1  -  LLNS,  -1 ) 

>  LLNS2  a  N  -  LLNS1 

>  IF  < LLNS2  .EQ.  0)  GO  TO  110 

>  CALL  MJ<MPMA<A<  JX,1>  ,  1.  NDIM,  AU.LLNSl  ♦  1),  1,  NDIM, 

>  1  A< JX.LLNS1  ♦  1),  1,  NDIM,  LLNS1  -  LLNS.  LLNS,  LLNS2,  -1J 

>  110  TIM4  -  SECOND<0) 

>  T I ME < 1 )  »  TIME< 1 )  -  TIM1  ♦  TIM2 

>  TIME<  2)  -  TIME<  2)  -  TIM2  ♦  TIMS 

>  T I ME < 3 )  »  TIME< 3)  -  TIMS  ♦  TIM4 

>  120  CONTINUE 

>  TIMS  -  SECOND<0) 

>C»***  FACTORIZATION  ENDED:  FORWARD  SUBSTITUTION 

>  DO  130  LL  ■  1 ,  N,  NSIZE 

>  LSAVE  a  LL 

>  IF  <LL  .EQ.  1)  GO  TO  130 

>  LLNS  a  MIN0<LL  ♦  NSIZE  -  1 ,N>  -  LL  ♦  1 

>  CALL  MJ<VPUA<A<  LL ,  1 )  ,  1.  NDIM,  B.  I,  B<LL)  ,  1,  LLNS,  LL  -  1.  -1J 

>  130  CONTINUE 

-,;•***  BACK  SUBSTITUTION  HAS  TWO  STEPS 

>C*  1ST  STEP:  M*B 

>  LZ  »  LSAVE 

>  DO  210  LM  «  1 ,  N,  NSIZE 

>  140  IF  <LM  .EQ.  1)  GO  TO  130 

>  LL  »  LZ  ♦  NSIZE 

>  NQ  a  n  -  LL  ♦  1 

>  CALL  MXVPVA<A<LZ,LL),  1,  NDIM,  B<LL) ,  1,  B<LZ).  1,  NSIZE,  NQ, 

>  1  -1) 

>C*  2ND  STEP:  <U**-1  L«*-l  B) 

>  ISO  LLNS  »  MINO <N , LZ  ♦  NSIZE  -  1J 

J  LLNSMZ  »  LLNS  -  LZ 

>  IF  < LLNSMZ  .LE.  OJ  GO  TO  170 

>  LLNSM1  a  LLNS  -  1 

>  DO  140  J  a  LZ,  LLNSM1 

>  JP1  -  J  ♦  l 

>  DO  140  K  ■  JP1 ,  LLNS 

>  140  B< K J  a  8<K>  -  A<K, J>  •  B<J) 

>  170  B<LLNS>  -  B<LLNS>  »  A< LLNS, LLNS) 

>  IF  <LLN9MZ  .LE.  0)  GO  TO  200 

J  DO  190  J  -  LZ,  LLNSM1 

>  JJ  -  LZ  ♦  LLNS  -  J 

>  JJM1  a  jj  -  i 

>  DO  180  K  a  LZ,  JJM1 

>  180  B<K)  ■  8<KJ  -  A<K, JJ)  «  B<JJJ 

>  190  B< JJM1 J  -  B< JJM1 )  *  A< JJM1 , JJM1 > 

>  200  LZ  -  LZ  -  NSIZE 
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>  210  CONTINUE 

>  TIME<4)  -  SEC0ND( 0 )  -  TIM5 

COMPUTE  PERFORMANCE 

>  CALL  RESULTCTIME ,  N.  NSIZE) 

CHECK  SOLUTION 

>  DO  22tf" JK  ■  1 ,  N 

>  BJK  «  JK  ♦  1 

>  IF  <ABS<8< JK)  -  BJK)  .OT.  1 .0-4)  GO  TO  230 

>  220  CONTINUE 

>  WRITE  <3, 230) 

>  GO  TO  260 

>  230  WRITE  <6,240)  N.  NSI2E,  <B< JK) , JK-1 ,N> 

>  240  FORMAT  <214.  4< 1  PEI  6. 8)/< 4< 1  PEI 6.8)  ) > 

>  STOP  2 

>  250  FORMAT  <'  OK') 

>  260  CONTINUE 

>  GO  TO  10 

>  END 

>  SUBROUTINE  ALUPIV<A,  Br  TEMP,  LL.  LLNS,  NDIM ,  N) 

>  DIMENSION  TEMP< 1 ) ,  A<NDIM,1>,  B< l > 

>  DO  90  K  ■  LL,  LLNS 

>  KP1  •  K  ♦  1 

>  JK  •  LLNS  -  K 

>  KMLL  -  K  -  LL 

>  NMK  ■  N  -  K 

>  MiKPl  -  NMK  ♦  1 

>  IF  < KMLL  .EQ.  0)  GO  TO  10 

>  CALL  MXVPVA<A<K,LL) ,  1,  NDIM,  A<LL,K) ,  1.  A<K,K>,  1,  NMKP1 , 

>  1  KMLL,  -1) 

>10  KK  »  ISAMAX<NMKP1 ,A<K,K),1)  ♦  K  -  1 

>  T  -  ABS<A<KK<K  >) 

>  IF  <T  . EO.  0)  WRITE  <6,20) 

>  20  FORMAT  <'  ZERO  PIVOT') 

>  DO  30  I  •  1,  N 

>  30  TEMPO)  -  A<  K ,  I  ) 

>CDIR*  IUDEP 

>  DO  40  I  -  1,  N 

>  40  A<  K , I )  -  A<KK, I ) 

>  DO  30  I  -  1 ,  N 

>  30  A<KK, I )  -  TEMP< I ) 

>  TEM  »  8<K) 

>  B<K)  -  B< KK) 

>  B< KK)  -  TEM 

>  A<K,K)  ■  1.  /  A<K,K) 

>  IF  <KMLL  .EQ.  0  .OR.  JK  ,EQ.  0)  GO  TO  70 

>  60  CALL  MXVPUA<A<LL,KP1>,  NDIM,  1,  A<K,LL> ,  NDIM.  A<K,KPI),  NDIM, 

>  1  JK,  KMLL,  -1) 

>  70  IF  <r#1K  .EQ.  0)  GO  TO  90 

>  DO  80  ILI  -  l,  KTIK 

>  80  A<K  ♦  ILI ,K)  •  T  *  A<K  ♦  ILI,K> 

>  90  CONTINUE 

>  RETURN 

>  END 

>C»»«*  THIS  ROUTINE  COMPUTES  PERFORMANCE  OF  COMPONENTS  OF 
>C»  FACTORIZATION  AND  OF  SOLUTION 

>  SUBROUTINE  RESULT<T1ME,  N,  NSIZE) 
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>  DIMENSION  TiMEU>.  FL0P<4),  0P<4> 

>  DO  1 0  J  -  1 .  4 

>  10  OP<  J)  *  0. 

>  DO  20  J  -  I .  N,  NS1ZE 

>  AN  ■  MININ  -  J  ♦  l.NSIZE) 

>  AN 2  ■  AN  *  AN 

>  aN3  ■  AN  *  Ms|2 

>  OPU  )  »  OP< 1 >  ♦  <<2.*FW3)/3.  >  -  <AN2/2.)  ♦  <3.*AN/4.)  ♦  RES  * 

>  i  AN2 
A 

>  RES  -  MAX0I0 ,N  -  J  ♦  1  -  NSIZE) 

>  OP<2)  ■  OP< 2)  ♦  RES  *  <  AN2  -  AN) 

>  20  OP< 3)  »  0P<3>  ♦  2.  •  RES  «  AN  *  RES 

>  AN  -  N 

>  OP< 4)  -  (2. *AN*AN)  -  AN 

>  DO  30  J  -  1 ,  4 

>  FLOP! J)  »  0. 

>  30  IF  <TIME< J>  .NE.  0.)  FLOP(J)  -  OP<J>  /  TIME(J) 

>  WRITE  <5,40)  <T1ME< J) ,OP< J> ,FLOP( J) , J*1 .4) 

>  40  FORMAT  <3<IPE12.4>) 

>C****  TOTAL  FACTORIZATION  PERFORMANCE 

>  TIME<  1 )  ■  TIMEU)  ♦  TIMEC2)  ♦  TIMEC3) 

>  0P<1>  *  0P<1>  ♦  0P<2>  ♦  OP<  3) 

>  FLOP!  1 )  »  0P<1>  /  TIMEU) 

>  WRITE  <3, 40)  TIMEU),  OP<  1 )  ,  FLOP!  i  ) 

>  RETURN 

>  END 
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COMMENTS  ON  CHARACTERISTICS 


% 


The  performance  of  Table  B-1  was  obtained  between  5-8am  on  12/10/85  at  MFECC. 
The  execution  rates  were  obtained  from  averaging  rates  of  100*1200  runs  of  each 
code.  As  much  as  a  20%  variation  in  average  rates  was  noted  in  the  64-way  unrolled 
code  by  running  at  different  times  of  the  day;  in  general,  the  average  rates  of  Figure 
B-1  tend  to  be  lower  than  rates  measured  at  other  times.  Small  differences  in 
performance  for  different  unrollings  probably  can  be  attributed  to  memory  loading 
variations. 

It  should  be  noted  that  2-way  unrolling  shows  no  advantage  over  1-way  unrolling  for 
small  matrices;  4-way  unrolling  shows  a  marked  advantage  for  all  sizes.  The 
consistent  degradation  of  64-way  vis-a-vis  32-way  unrolling  is  not  explained. 


It  was  decided  that  16-way  unrolling  offered  a  reasonable  performance-complexity 
compromise,  and  was  adopted  for  use  in  the  Fortran  M*V  factorization  codes  of  Figures 
3  and  5.  See  Appendix  A  for  listing  of  an  unrolled  multiply. 


Figure  B-l .  Effect  of  unrolling  on  M*V  performance 


